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Abstract 

Both linear mixed models (LMMs) and sparse regression models are widely used in genetics applica- 
tions, including, recently, polygenic modeling in genome-wide association studies. These two approaches 
make very different assumptions, so are expected to perform well in different situations. However, in 
practice, for a given data set one typically does not know which assumptions will be more accurate. 
Motivated by this, we consider a hybrid of the two, which we refer to as a "Bayesian sparse linear mixed 
model" (BSLMM) that includes both these models as special cases. We address several key computa- 
tional and statistical issues that arise when applying BSLMM, including appropriate prior specification 
for the hyper-parameters, and a novel Markov chain Monte Carlo algorithm for posterior inference. We 
apply BSLMM and compare it with other methods for two polygenic modeling applications: estimat- 
ing the proportion of variance in phenotypes explained (PVE) by available genotypes, and phenotype 
(or breeding value) prediction. For PVE estimation, we demonstrate that BSLMM combines the ad- 
vantages of both standard LMMs and sparse regression modeling. For phenotype prediction it con- 
siderably outperforms cither of the other two methods, as well as several other large-scale regression 
methods previously suggested for this problem. Software implementing our method is freely available 



from http : / / stephenslab . uchicago . edu/ software . html 



Author Summary 

The goal of polygenic modeling is to better understand the relationship between genetic variation and 
variation in observed characteristics, including variation in quantitative traits (e.g. cholesterol level in 
humans, milk production in cattle) and disease susceptibility. Improvements in polygenic modeling will 
help improve our understanding to this relationship, and could ultimately lead to, for example, changes 
in clinical practice in humans, or better breeding/mating strategies in agricultural programs. Polygenic 
models present important challenges, both at the modeling/statistical level (what modeling assumptions 
produce the best results), and at the computational level (how should these models be effectively fit to 
data). We develop novel approaches to help tackle both these challenges, and demonstrate the gains in 
accuracy that result on both simulated and real data examples. 
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Introduction 

Both linear mixed models (LMMs) and sparse regression models are widely used in genetics applications. 
For example, LMMs are often used to control for population stratification, individual relatedness, or 
unmeasured confounding factors when performing association tests in genetic association studies [THS] 
and gene expression studies [T0HT2"] . They have also been used in genetic association studies to jointly 
analyze groups of SNPs |13U14j . Similarly, sparse regression models have been used in genome- wide asso- 
ciation analyses [T5H2TJ] and in expression QTL analysis [3T]. Further, both LMMs and sparse regression 
models have been applied to, and garnered renewed interest in, polygenic modeling in association stud- 
ies. Here, by polygenic modeling we mean any attempt to relate phenotypic variation to many genetic 
variants simultaneously (in contrast to single-SNP tests of association). The particular polygenic mod- 
cling problems that wc focus on here arc estimating "chip hcritability" , being the proportion of variance 
in phenotypes explained (PVE) by available genotypes [T!?I[2"2"H2~1] . and predicting phenotypes based on 
genotypes [2"5H2"9"] . 

Despite the considerable overlap in their applications, in the context of polygenic modeling, LMMs and 
sparse regression models are based on almost diametrically opposed assumptions. Precisely, applications 
of LMMs to polygenic modeling (e.g. [22]) effectively assume that every genetic variant affects the pheno- 
type, with effect sizes normally distributed, whereas sparse regression models, such as Bayesian variable 
selection regression models (B VSR) [T8HT9] , assume that a relatively small proportion of all variants affect 
the phenotype. The relative performance of these two models for polygenic modeling applications would 
therefore be expected to vary depending on the true underlying genetic architecture of the phenotype. 
However, in practice, one does not know the true genetic architecture, so it is unclear which of the two 
models to prefer. Motivated by this observation, we consider a hybrid of these two models, which we 
refer to as the "Bayesian sparse linear mixed model" , or BSLMM. This hybrid includes both the LMM 
and a sparse regression model, BVSR, as special cases, and is to some extent capable of learning the 
genetic architecture from the data, yielding good performance across a wide range of scenarios. By being 
"adaptive" to the data in this way, our approach obviates the need to choose one model over the other, 
and attempts to combine the benefits of both. 

The idea of a hybrid between LMM and sparse regression models is, in itself, not new. Indeed, models 
like these have been used in breeding value prediction to assist genomic selection in animal and plant 
breeding programs [30H35] . gene selection in expression analysis while controlling for batch effects [36] . 
phenotype prediction of complex traits in model organisms and dairy cattle |37H40j . and more recently, 
mapping complex traits by jointly modeling all SNPs in structured populations [41]. Compared with these 
previous papers, our work makes two key contributions. First, we consider in detail the specification of 
appropriate prior distributions for the hyper-parameters of the model. We particularly emphasize the 
benefits of estimating the hyper-parameters from the data, rather than fixing them to pre-specified values 
to achieve the adaptive behavior mentioned above. Second, we provide a novel computational algorithm 
that exploits a recently described linear algebra trick for LMMs [8j|9] . The resulting algorithm avoids ad 
hoc approximations that are sometimes made when fitting these types of model (e.g. [371I4T] ), and yields 
reliable results for data sets containing thousands of individuals and hundreds of thousands of markers. 
(Most previous applications of this kind of model involved much smaller data sets.) 

Since BSLMM is a hybrid of two widely used models, it naturally has a wide range of potential uses. 
Here we focus on its application to polygenic modeling for genome-wide association studies, specifically 
two applications of particular interest and importance: PVE estimation (or "chip hcritability" estima- 
tion) and phenotype prediction. Estimating the PVE from large-scale genotyped marker sets (e.g. all 
the SNPs on a genome-wide genotyping chip) has the potential to shed light on sources of "missing her- 
itability" [42] and the underlying genetic architecture of common diseases [19,22 24|[43]. And accurate 
prediction of complex phenotypes from genotypes could ultimately impact many areas of genetics, in- 
cluding applications in animal breeding, medicine and forensics [271-29, 37 40 . Through simulations and 
applications to real data, we show that BSLMM successfully combines the advantages of both LMMs and 
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sparse regression, is robust to a variety of settings in PVE estimation, and outperforms both models, and 
several related models, in phenotype prediction. 
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Methods 

Background and Motivation 

We begin by considering a simple linear model relating phcnotypes y to genotypes X: 

y = l nj Lt + X/3 + e, (1) 
e ~ MVN n (0, I n r _1 ). (2) 

Here y is an n- vector of phcnotypes measured on n individuals, X is an n x p matrix of genotypes 
measured on the same individuals at p genetic markers, (3 is a p-vector of (unknown) genetic marker 
effects, 1„ is an n-vector of Is, fi is a scalar representing the phcnotype mean, and e is an n-vector of 
error terms that have variance r . MVN„ denotes the n-dimcnsional multivariate normal distribution. 
Note that there are many ways in which measured genotypes can be encoded in the matrix X. We assume 
throughout this paper that the genotypes are coded as 0, 1 or 2 copies of a reference allele at each marker, 
and that the columns of X are centered but not standardized; sec Text SI Detailed Methods. 

A key issue is that, in typical current datasets (e.g. GWAS), the number of markers p is much 
larger than the number of individuals n. As a result, parameters of interest (e.g. f3 or PVE) cannot 
be estimated accurately without making some kind of modeling assumptions. Indeed, many existing 
approaches to polygenic modeling can be derived from ([T]) by making specific assumptions about the 
genetic effects j3. For example, the LMM approach from [22], which has recently become commonly 
used for PVE estimation (e.g. [241I44TI46] ) . is equivalent to the assumption that effect sizes are normally 
distributed, such that 

ft~N(0,a b 2 /(pr)). (3) 

[Specifically, exact equivalence holds when the relatedness matrix in the LMM is computed from the 
genotypes as K = -XX. T (e.g. [47j). |22| use a matrix in this form, with X centered and standardized, 
and with a slight modification of the diagonal elements.] For brevity, in this paper we refer to the 
regression model that results from this assumption as the LMM (note that this is relatively restrictive 
compared with the usual definition) ; it is also commonly referred to as "ridge regression" in statistics |48j . 
The estimated combined effects (X/3), or equivalently, the estimated random effects, obtained from this 
model are commonly referred to as Best Linear Unbiased Predictors (BLUP) [49] . 

An alternative assumption, which has also been widely used in polygenic modeling applications [18[ 
I19[I34|. and more generally in statistics for sparse high-dimensional regression with large numbers of 
covariates |50[l51j. is that the effects come from a mixture of a normal distribution and a point mass at 
0, also known as a point- normal distribution: 

ft~7rN(0,^/(pr)) + (l-7r)<5 , (4) 

where tt is the proportion of non-zero /3 and So denotes a point mass at zero. [This definition of ir 
follows the convention from statistics |19U501f5T] . which is opposite to the convention in animal breeding 
[2"Tll3"!?H3~iII40| .] We refer to the resulting regression model as Bayesian Variable Selection Regression 
(BVSR), because it is commonly used to select the relevant variables (i.e. those with non-zero effect) for 
phcnotype prediction. Although Q formally includes ([3]) as a special case when it = 1, in practice @ 
is often used together with an assumption that only a small proportion of the variables are likely to be 
relevant for phenotype prediction, say by specifying a prior distribution for ir that puts appreciable mass 
on small values (e.g. pj5]). In this case, BVSR and LMM can be viewed as making almost diametrically 
opposed assumptions: the LMM assumes every variant has an effect, whereas BVSR assumes that a very 
small proportion of variants have an effect. (In practice, the estimate of <Jb under LMM is often smaller 
than the estimate of a a under BVSR, so we can interpret the LMM as assuming a large number of small 
effects, and BVSR as assuming a small number of larger effects.) 
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A more general assumption, which includes both the above as special cases, is that the effects come 
from a mixture of two normal distributions: 

A ~ ttN(0, {al + al)/{jpr)) + (1 - 7r)N(0, a 2 b /(pr)), (5) 

Setting 7r = yields the LMM ([3]), and <7& = yields BVSR (TJ}. we can interpret this model as assuming 
that all variants have at least a small effect, which are normally distributed with variance a%/(pr), and 
some proportion (it) of variants have an additional effect, normally distributed with variance a^/(pr). 
The earliest use of a mixture of two normal distributions for the regression coefficients that we are aware 
of is [52], although in that paper various hyper-parameters were fixed, and so it did not include LMM 
and BVSR as special cases. 

Of the three assumptions on the effect size distributions above, the last ([5]) is clearly the most flexible 
since it includes the others as special cases. Obviously other assumptions are possible, some still more 
flexible than the mixture of two normals: for example, a mixture of three or more normals. Indeed, many 
other assumptions have been proposed, including variants in which a normal distribution is replaced by 
a t distribution. These variants include the "Bayesian alphabet models" - so-called simply because they 
have been given names such as BayesA, BayesB, BayesC, etc. - that have been proposed for polygenic 
modeling, particularly breeding value prediction in genomic selection studies. Table [1] summarizes these 
models, and some other effect size distributions that have been proposed, together with relevant references 
(see also [53] and the references there in). Among these, the models most closely related to ours are 
BayesC7r [34] and BayesR [35]. Specifically, BayesC7r without a random effect is BVSR, and with a 
random effect is BSLMM (which we define below). BayesR models effect sizes using a mixture of three 
normal components plus a point mass at zero, although the relative variance for each normal distribution 
is fixed. 

Given the wide range of assumptions for effect size distributions that have been proposed, it is natural 
to wonder which are the most appropriate for general use. However, answering this question is complicated 
by the fact that even given the effect size distribution there are a number of different ways that these 
models can be implemented in practice, both in terms of statistical issues, such as treatment of the 
hyper-parameters, and in terms of computational and algorithmic issues. Both these types of issues can 
greatly affect practical performance. For example, many approaches fix the hyper-parameters to specific 
values [23 32 , 33 , 40] which makes them less flexible [341154] , Thus, in this paper we focus on a particular 
effect size distribution ([5]), which while not the most flexible among all that could be considered, is 
certainly more flexible than the one that has been most used in practice for estimating PVE (LMM), 
and admits certain computational methods that could not be applied in all cases. We consider in detail 
how to apply this model in practice, and the resulting advantages over LMM and BVSR (although 
we also compare with some other existing approaches). A key contribution of this paper is to provide 
new approaches to address two important practical issues: the statistical issue of how to deal with the 
unknown hyper-parameters (tt, er a , tr^) , and the computational issue of how to fit the model. Notably, 
with the computational tools we use here, fitting the model ([5]) becomes, for a typical data set, less 
computationally intensive than fitting BVSR, as well as providing gains in performance. 

With this background, we now turn to detailed description of the model, its prior specification and 
its computation algorithm. 

The Bayesian Sparse Linear Mixed Model 

In this paper we focus on the simple linear model ([T]) with mixture prior (f5]) on the effects. However, the 
computational and statistical methods we use here also apply to a more general model, which we refer to 
as the Bayesian Sparse Linear Mixed Model (BSLMM), and which includes the model (fT]) with ([5]) as a 
special case. 
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The BSLMM consists of a standard linear mixed model, with one random effect term, and with 
sparsity inducing priors on the regression coefficients: 



where u is an n-vector of random effects with known n x n covariance matrix K. In referring to u as the 
"random effects" we are following standard terminology from LMMs. Standard terminology also refers to 
the coefficients f3 as "fixed effects" , but this phrase has a range of potential meanings [55] and so we avoid 
it here. Instead we use the term "sparse effects" for these parameters to emphasize the sparsity-inducing 
prior. 

It is straightforward to show that when K = iXX r , BSLMM is equivalent to the simple linear model 
([I]) with mixture prior ([5J on the effects. However, our discussion of prior specification, computational 
algorithms, and software, all apply for any K. 

When we say that ^ is equivalent to ([I} with ([S]), this equivalence refers to the implied probability 
model for y given X and the hyper-parameters fi, r, n, a a , a^. However, (3 and (3 are not equivalent 
(explaining our use of two different symbols): in ([6]) the random effect u captures the combined small 
effects of all markers, whereas in (Q]) these small effects are included in (3. Since both our applications 
focus on the relationship between y and X, and not on interpreting estimates of (3 or 0, we do not concern 
ourselves any further with this issue, although it may need consideration in applications where individual 
estimated genetic effects are of more direct interest (e.g. genetic association mapping). A related issue 
is the interpretation of the random effect u in BSLMM: from the way we have presented the material u 
is most naturally interpreted as representing a polygenic component, specifically the combined effect of 
a large number of small effects across all measured markers. However, if there are environmental factors 
that influence the phenotype and are correlated with genotype (e.g. due to population structure), then 
these would certainly affect estimates of u, and consequently also affect estimates of other quantities, 
including the PVE. In addition, phenotype predictions from BSLMM will include a component due to 
unmeasured environmental factors that are correlated with measured genotypes. These issues are, of 
course, not unique to BSLMM - indeed, they apply equally to the LMM; see [56] and the response 
from [57) for relevant discussion. 

Finally, given the observation that a mixture of two normals is more flexible than a point-normal, it 
might seem natural to consider modifying ([6]) by making the assumption that (3 comes from a mixture 
of two normal distributions rather than a point-normal. However, if K = ^XX T then this modification 
is simply equivalent to changing the values of a a , Ob- 

Prior Specification 

The BSLMM involves (hyper-)parameters, h,t, it, a a , and ov Before considering prior specification for 
these parameters, we summarize their interpretations as follows: 

• /i and r _1 control the phenotype mean and residual variance. 

• 7r controls the proportion of (3 values in ^ that are non-zero. 

• a a controls the expected magnitude of the (non-zero) (3. 

• Oh controls the expected magnitude of the random effects u. 



fli ~ 7rN(0, cr^T -1 ) + (I — 7r)#o, 



y = l„^ + X/3 + u + e, 
u~MVN„(0, ( 7 6 2 r- 1 K), 
e~MVN rl (0,r- 1 I„), 



(6) 
(7) 
(8) 
(9) 
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Appropriate values for these parameters will clearly vary for different data sets, so it seems desirable 
to estimate them from the data. Here we accomplish this in a Baycsian framework by specifying prior 
distributions for the parameters, and using Markov chain Monte Carlo (MCMC) to obtain approximate 
samples from their posterior distribution given the observed data. Although one could imagine instead 
using maximum likelihood to estimate the parameters, the Bayesian framework has several advantages 
here: for example, it allows for incorporation of external information (e.g. that most genetic markers will, 
individually, have small effects), and it takes into account of uncertainty in parameter estimates when 
making other inferences (e.g. phenotype prediction). 

For the mean [i and the inverse of error variance, r, we use the standard conjugate prior distributions: 

T~Gamma((ti,K2), (10) 
H\t ~ N(0, er^r -1 ), (11) 

where k\ and K2 denote, respectively, shape and rate parameters of a Gamma distribution. Specifically we 
consider the posterior that arises in the limits k\ — > 0, ft 2 — > and er 2 — > oo. These limits correspond to 
improper priors, but the resulting posteriors are proper, and scale appropriately with shifting or scaling 
of the phenotype vector y |58j . In particular, these priors have the property that conclusions will be 
unaffected by changing the units of measurement of the phenotype, which seems desirable for a method 
intended for general application. 

Prior specification for the remaining hyper-parameters 7r,<7^,er^ is perhaps more important. Our 
approach is to extend the prior distributions for BVSR described in |19) . 

Following |19j we place a uniform prior on log7r: 

log(7r)~U(log(l/p),log(l)), (12) 

where p is total number of markers being analyzed. The upper and lower limits of this prior were chosen 
so that 7r (the expected proportion of markers with non-zero (3) ranges from 1/p to 1. A uniform prior on 
log 7r reflects the fact that uncertainty in tt in a typical GWAS will span orders of magnitude. A common 
alternative (see e.g. [T81I34]) is a uniform distribution on ir, but as noted in [19] this puts large weight on 
large numbers of markers having non-zero (3 (e.g. it would correspond to placing 50% prior probability 
to the event that more than half of the markers have non-zero /3, and correspond to placing 90% prior 
probability to the event that more than 10% of the markers have non-zero (3). 

To specify priors for a a and Cf,, we exploit the following idea from [19]: prior specification is easier 
if we first re-parameterize the model in terms of more interpretable quantities. Specifically we extend 
ideas from [19] to re-parameterize the model in terms of the (expected) proportion of phenotypic variance 
explained by the sparse effects and by the random effects. 

To this end, we define PVE (the total proportion of variance in phenotype explained by the sparse 
effects and random effects terms together) and PGE (the proportion of genetic variance explained by the 
sparse effects terms) as functions of /3, u and r 

PVE(j9,u,t) 

PGE(^u) 



V(X^ ± u L _ ; (13) 



V(X/3 + u) + t 

v(x3) 

V(X/3 + u) ' 



(14) 



where the function V(x) is defined as 

n 

V(x):=-5> 8 -x) 2 . (15) 



n 

i=l 
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These definitions ensure that both PVE and PGE must lie in the interval [0, 1]. PVE reflects how well 
one could predict phenotype y from the available SNPs if one knew the optimal (3 as well as the random 
effects u; together with PVE, PGE reflects how well one could predict phenotype y using alone. 

Since PVE and PGE are functions of (/3, u, r), whose distributions depend on hyper-parameters 
7r,er a ,<7b, the prior distribution for PVE and PGE depends on the priors assigned to these hyper- 
paramctcrs. In brief, our aim is to choose priors for the two hyper-parameters a 2 and a\ so that the 
induced priors on both PVE and PGE are roughly uniform on and 1. (Other distributions could be 
chosen if desired, but we consider this uniform distribution one reasonable default.) However, because 
the relationship between the distribution of PVE, PGE and the hyper-parameters is not simple, we have 
to make some approximations. 

Specifically, we introduce h, p as approximations (they are ratios of expectations rather than expec- 
tations of ratios) to the expectations of PVE and PGE, respectively: 

pirsgal + s b a 2 

pnsacrl + s b a 2 + 1 ' 

P^s a a 2 a 
P := 2~, 2 > 

where s a is the average variance of genotypes across markers, and Sb is the mean of diagonal elements in 
K. In other words, s a — ^ Y2i=i £j=i x ij an< ^ Sb ~ n S"=i ^U> where xtj and fc.y are the zjth elements 
of matrices X and K, respectively. See Text SI Detailed Methods for derivations. Intuitively, the term 
pirs a a 2 captures the expected genetic variance contributed by the sparse effects term (relative to the error 
variance), because pir is the expected number of causal markers, a 2 is the expected effect size variance of 
each causal marker (relative to the error variance), and s a is the average variance of marker genotypes. 
Similarly, the term Sbcr 2 captures the expected genetic variance contributed by the random effects term 
(relative to the error variance), because a 2 is the expected variance of the random effects (relative to the 
error variance) when the relatedness matrix has unit diagonal elements, while Sb properly scales it when 
not. 

The parameter p provides a natural bridge between the LMM and BVSR: when p = BSLMM 
becomes the LMM, and when p = 1 BSLMM becomes BVSR. In practice, when the data favors the 
LMM, the posterior distribution of p would mass near 0, and when the data favors BVSR, the posterior 
distribution of p would mass near 1. 

In summary, the above re-parameterizes the model in terms of (h,p,ir) instead of (<7 a , at, "")■ Now, 
instead of specifying prior distributions for er a , <jb, we rather specify prior distributions for h, p. Specifically 
we use uniform prior distributions for h, p: 

ft~U(0,l), (18) 
P~U(0,1), (19) 

independent of one another and of ir. Since h and p approximate PVE and PGE, these prior distributions 
should lead to reasonably uniform prior distributions for PVE and PGE, which we view as reasonable 
defaults for general use. (If one had specific information about PVE and PGE in a given application 
then this could be incorporated here.) In contrast it seems much harder to directly specify reasonable 
default priors for a a , a b (although these priors on h, p, 7r do of course imply priors for er a , a b ] see Text SI 
Detailed Methods). 

Note that we treat h and p as approximations to PVE and PGE only for prior specification; when 
estimating PVE and PGE from data we do so directly using their definitions (fT5|) and (see below 
for details). 



(16) 
(17) 
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Posterior Sampling Scheme 



7i ~ Bcrnoulli(7r) , 



(20) 
(21) 
(22) 



To facilitate computation, we use the widely-used approach from |52| of introducing a vector of binary 
indicators 7 = (71, • • • , 7 p ) € {0, 1} P that indicates whether the corresponding coefficients j3 are non-zero. 
The point-normal priors for j3 can then be written 



where 0^ denotes the sub- vector of /3 corresponding to the entries {i : 7^ = 1}; (3 denotes the sub- 
vector of (3 corresponding to the other entries, {i : 7,; = 0}; and | — y | denotes the number of non-zero 
entries in 7. We use MCMC to obtain posterior samples of (h, p, 7r,7) on the product space (0,1) x 
(0, 1) x (0, 1) x {0, which is given by 

P(&,p,7r >7 |y) cx P(y\h,p,ir,j)P(h)P(p)P(j\7r)P(n). (23) 

The marginal likelihood P(y\h, p, 7r, 7) can be computed analytically integrating out (/3,u,r); see below 
for further details. We use a Metropolis-Hastings algorithm to draw posterior samples from the above 
marginal distribution. In particular, we use a rank based proposal distribution for 7 |19j . which focus 
more of the computational time on examining SNPs with stronger marginal associations. 

We use the resulting sample from the posterior distribution (f2"3")l to estimate PVE and PGE as follows. 
For each sampled value of (h, p, 7r, 7), we sample a corresponding value for (r, /3, u) from the conditional 
distribution P(r, /3, u|y, h, p, ir, 7). We then use each sampled value of (r, /3, u) to compute a sampled 
value of PVE and PGE, using equations (fT3"|) and (|14[) . We estimate the posterior mean and standard 
deviation of PVE, PGE, from these samples. 

The novel part of our algorithm is a new efficient approach to evaluating the likelihood P(y\h, p, ir, 7) 
that considerably reduces the overall computational burden of the algorithm. The direct naive approach 
to evaluating this likelihood involves a matrix inversion and a matrix determinant calculation that scale 
cubically with the number of individuals n, and this cost is incurred every iteration as hyper parameter 
values change. Consequently, this approach is impractical for typical association studies with large 
n, and ad hoc approximations are commonly used to reduce the burden. For example, both |37j and 
[41j fix o~\ to some pre-estimated value. As we show later, this kind of approximation can reduce the 
accuracy of predicted phenotypes. Here, we avoid such approximations by exploiting recently developed 
computational tricks for LMMs 8,9. The key idea is to perform a single eigen-decomposition and use 
the resulting unitary matrix (consisting of all eigen vectors) to transform both phenotypes and genotypes 
to make the transformed values follow independent normal distributions. By extending these tricks 
to BSLMM we evaluate the necessary likelihoods much more efficiently. Specifically, after a single n 3 
operation at the start of the algorithm, the per iteration computational burden is linear in n (the same 
as BVSR), allowing large studies to be analyzed. 

Full details of the sampling algorithm appear in Text S2. 

URLs 

Software implementing our methods is included in the GEMMA software package, which is freely available 
at |http : / /st ephenslab . uchicago . edu/ software . html 
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Results 

Simulations: PVE Estimation 

Both the LMM and BVSR have been used to estimate the PVE [HJH5]. Since the LMM assumes that 
all SNPs have an effect, while BVSR assumes that only a small proportion of SNPs have an effect, we 
hypothesize that BVSR will perform better when the true underlying genetic structure is sparse and 
LMM will perform better when the true genetic structure is highly polygenic. Further, because BSLMM 
includes both as special cases, we hypothesize that BSLMM will perform well in either scenario. 

To test these hypotheses, we perform a simulation using real genotypes at about 300,000 SNPs in 
3,925 Australian individuals |22j . and simulate phenotypes under two different scenarios. In Scenario I we 
simulate a fixed number S of causal SNPs (with S = 10, 100, 1000, 10000), with effect sizes coming from a 
standard normal distribution. These simulations span a range of genetic architectures, from very sparse 
to highly polygenic. In Scenario II we simulate two groups of causal SNPs, the first group containing a 
small number of SNPs of moderate effect (S = 10 or S = 100), plus a second larger group of 10, 000 SNPs 
of small effect representing a "polygenic component" . This scenario might be considered more realistic, 
containing a mix of small and larger effects. For both scenarios we added normally-distributed errors to 
phenotype to create datasets with true PVE =0.6 and 0.2 (equation [T3"|) . We simulate 20 replicates in 
each case, and run the algorithms with all SNPs, including the simulated causal variants, so that the true 
PVE for typed markers is either 0.6 or 0.2 (if we excluded the causal variants then the true PVE would 
be unknown). 

Figures |TJ\ and HP, show the root of mean square error (RMSE) of the PVE estimates obtained by 
each method, and Figure |TJ3 and QJ) summarize the corresponding distributions of PVE estimates. In 
agreement with our original hypotheses, BVSR performs best (lowest RMSE) when the true model is 
sparse (e.g. Scenario I, S = 10 or S = 100 in Figures [TJ\, \Vp). However, it performs very poorly under 
all the other, more polygenic, models. This is due to a strong downward bias in its PVE estimates 
(Figures [TJ3, QJ)). Conversely, under the same scenarios, LMM is the least accurate method. This is 
because the LMM estimates have much larger variance than the other methods under these scenarios 
(Figure [T^fTf)), although, interestingly, LMM is approximately unbiased even in these settings where 
its modeling assumptions are badly wrong. As hypothesized, BSLMM is robust across a wider range of 
settings than the other methods: its performance lies between LMM and BVSR when the true model is 
sparse, and provides similar accuracy to LMM when not. 

Of course, in practice, one does not know in advance the correct genetic architecture. This makes the 
stable performance of BSLMM across a range of settings very appealing. Due to the poor performance 
of BVSR under highly polygenic models, we would not now recommend it for estimating PVE in general, 
despite its good performance when its assumptions are met. 

Simulations: Phenotype Prediction 

We also compare the three methods on their ability to predict phenotypes from genotypes, using the 
same simulations. 

To measure prediction performance, we use relative prediction gain (RPG; see Text SI Detailed 
Methods). In brief, RPG is a standardized version of mean square error: RPG=1 when accuracy is as 
good as possible given the simulation setup, and RPG=0 when accuracy is the same as simply predicting 
everyone to have the mean phenotype value. RPG can be negative if accuracy is even worse than that. 

Figure [2] compares RPG of different methods for simulations with PVE=0.6 (results for PVE=0.2 are 
qualitatively similar, not shown). Interestingly, for phenotype prediction, the relative performance of the 
methods differs from results for PVE estimation. In particular, LMM performs poorly compared with the 
other two methods in all situations, except for Scenario I with S = 10, 000, the Scenario that comes closest 
to matching the underlying assumptions of LMM. As we expect, BSLMM performs similarly to BVSR 
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in scenarios involving smaller numbers of causal SNPs (up to S = 1,000 in Scenario I), and outperforms 
it in more polygenic scenarios involving large numbers of SNPs of small effect (e.g. Scenario II). This is 
presumably due to the random effect in BSLMM that captures the polygenic component, or, equivalcntly, 
due to the mixture of two normal distributions in BSLMM that better captures the actual distribution 
of effect sizes. The same qualitative patterns hold when we redo these simulation comparisons excluding 
the causal SNPs (Figure [ST]) or use correlation instead of RPG to assess performance (Figure IS2l and [S3]) . 

For a potential explanation why LMM performs much less well for phenotype prediction than for PVE 
estimation, we note the difference between these two problems: to accurately estimate PVE it suffices 
to estimate the "average" effect size reliably, whereas accurate phenotype prediction requires accurate 
estimates of individual effect sizes. In situations where the normal assumption on effect sizes is poor, 
LMM tends to considerably underestimate the number of large effects, and overestimate the number of 
small effects. These factors can cancel one another out in PVE estimation, but both tend to reduce 
accuracy of phenotype prediction. 

Estimating PVE in Complex Human Traits 

To obtain further insights into differences between LMM, BVSR and BSLMMM, we apply all three 
methods to estimate the PVE for five traits in two human GWAS data sets. The first data set contains 
height measurements of 3,925 Australian individuals with about 300,000 typed SNPs. The second data set 
contains four blood lipid measurements, including high-density lipoprotein (HDL), low-density lipoprotein 
(LDL), total cholesterol (TC) and triglycerides (TG) from 1,868 Caucasian individuals with about 550,000 
SNPs. The narrow sense heritability for height is estimated to be 0.8 from twin-studies (22j[59]. The 
narrow sense heritabilitics for the lipid traits have been estimated, in isolated founder populations, to be 
0.63 for HDL, 0.50 for LDL, 0.37 for TG in Huttcrites [60], and 0.49 for HDL, 0.42 for LDL, 0.42 for TC 
and 0.32 for TG in Sardinians [61] . 

Table [2] shows PVE estimates from the three methods for the five traits. PVE estimates from BVSR 
are consistently much smaller than those obtained by LMM and BSLMM, which arc almost identical for 
two traits and similar for the others. Estimates of PVE from both LMM and BSLMM explain over 50% 
of the narrow sense heritability of the five traits, suggesting that a sizable proportion of heritability of 
these traits can be explained, either directly or indirectly, by available SNPs. 

These results, with LMM and BSLMM providing similar estimates of PVE, and estimates from BVSR 
being substantially lower, are consistent with simulation results for a trait with substantial polygenic com- 
ponent. One feature of BSLMM, not possessed by the other two methods, is that it can be used to attempt 
to quantify the relative contribution of a polygenic component, through estimation of PGE, which is the 
proportion of total genetic variance explained by "large" effect size SNPs (or more precisely, by the addi- 
tional effects of those SNPs above a polygenic background). Since the PGE is defined within an inevitably 
over-simplistic model, specifically that effect sizes come from a mixture of two normal distributions, and 
also because it will be influenced by unmeasured environmental factors that correlate with genetic fac- 
tors, we caution against over-interpreting the estimated values. We also note that estimates of PGE for 
these data (Table [2]) are generally not very precise (high posterior standard deviation). Nonetheless, it is 
interesting that the estimated PGE for height, at 0.12, is lower than for any of the lipid traits (ranging 
from 0.18 for TG to 0.46 for TC), and that all these estimates suggest a substantial contribution from 
small polygenic effects in all five traits. 

Predicting Disease Risk in the WTCCC Data Set 

To assess predictive performance on real data, we turn to the Wellcome trust case control consortium 
(WTCCC) 1 study [62], which have been previously used for assessing risk prediction [63H65] . These data 
include about 14,000 cases from seven common diseases and about 3,000 shared controls, typed at a total 
of about 450,000 SNPs. The seven common diseases are bipolar disorder (BD), coronary artery disease 
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(CAD), Crohn's disease (CD), hypertension (HT), rheumatoid arthritis (RA), type 1 diabetes (T1D) and 
type 2 diabetes (T2D). 

We compared the prediction performance of LMM, BVSR and BSLMM for all seven diseases. Follow- 
ing [M], we randomly split the data for each disease into a training set (80% of individuals) and a test set 
(remaining 20%), performing 20 such splits for each disease. We estimated parameters from the training 
set by treating the binary case control labels as quantitative traits, as in [5][9]. [This approach can be 
justified by recognizing the linear model as a first order Taylor approximation to a generalized linear 
model; we discuss directly modeling binary phenotypes in the Discussion section.] We assess prediction 
performance in the test set by area under the curve (AUC) [6T>] . 

Figure [3] shows AUC for the three methods on all seven diseases. As in our simulations, we find 
BSLMM performs as well as or better than either of the other two methods for all seven diseases. Indeed, 
the performance of BSLMM appears to compare favorably with previous methods applied to the same data 
set [6"3T - |6"5] (a precise comparison with previous results is difficult, as some studies use slightly different 
splitting strategics |631I65| and some do not perform full cross validation [64] ) . As might be expected from 
the simulation results, BVSR performs better than LMM in diseases where a small number of relatively 
strong associations were identified in the original study [62] (CD, RA and T1D) and worse in the others. 
We obtained qualitatively similar results when we measured performance using the Brier score instead of 
AUC (Text S3; Figure M). 

Finally, we caution that, although BSLMM performs well here relative to other methods, at the 
present time, for these diseases, its prediction accuracy is unlikely to be of practical use in human clinical 
settings. In particular, in these simulations the number of cases and controls in the test set is roughly 
equal, which represents a much easier problem than clinical settings where disease prevalence is generally 
low even for common diseases (see |64j for a relevant discussion). 

Predicting Quantitative Phenotypes in Heterogeneous Stock Mice 

In addition to the WTCCC data set, we also assess perdition performance using a mouse data set [67], 
which has been widely used to compare various phenotype prediction methods [37H39] . The mouse data 
set is substantially smaller than the human data (n ~ 1000, p ~ 10,000, with exact numbers varying 
slightly depending on the phenotype and the split). This makes it computationally feasible to compare 
with a wider range of other methods. Therefore, we include in our comparisons here five other related 
approaches, some of which have been proposed previously for phenotype prediction. Specifically we 
compare with: 

1. LMM-Bayes, a Bayesian version of the LMM, which we fit by setting p = in BSLMM using our 
software. 

2. Bayesian Lasso [39,68), implemented in the R package BLR |39j . 

3. BayesA-Flex, our own modification of BayesA, which assumes a t distribution for the effect sizes. 
Our modification involves estimating the scale parameter associated with the t± distribution from 
the data (Text SI Detailed Methods). Although the original BayesA performs poorly in this data 
set [38], this simple modification greatly improves its prediction performance (emphasizing again 
the general importance of estimating hyper-parameters from the data rather than fixing them to 
arbitrary values) . We modified the R package BLR [33] to obtain posterior samples from this model. 

4. BayesC7r, implemented in the online software GenSel [M]. This implementation does not allow 
random effects, and therefore uses the same model as our BVSR, although with different prior 
distributions. 
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5. BSLMM-EB (where EB stands for empirical Baye^l), an approximation method to fit BSLMM. 
The method fixes the variance component a\ to its REML (restricted maximum likelihood) estimate 
obtained with (3 = 0, which is one of several strategies used in previous studies to alleviate the 
computational burden of fitting models similar to BSLMM [37ll41j . We sample posteriors from this 
model using our software. 

See Text SI Detailed Methods for further details. 

Following previous studies that have used these data for prediction [37rf3"9] we focused on three 
quantitative phenotypes: CD8, MCH and BMI. These phenotypes have very different estimated narrow 
sense heritabilities: 0.89, 0.48, and 0.13 respectively [69]. Table [Si] lists estimates of some key quantities 
for the three traits - including PVE, PGE and log 10 (7r) - obtained from LMM, BVSR and BSLMM. 
All three methods agree well on the PVE estimates, suggesting that the data is informative enough to 
overwhelm differences in prior specification for PVE estimation. 

Following [371(32 ; we divide the mouse data set roughly half and half into a training set and a test 
set. As the mice come from 85 families, and individuals within a family are more closely related than 
individuals from different families, we also follow previous studies and use two different splits of the data: 
the intra-family split mixes all individuals together and randomly divides them into two sets of roughly 
equal size; the inter-family split randomly divides the 85 families into two sets, where each set contains 
roughly half of the individuals. We perform 20 replicates for each split of each phenotype. It is important 
to note that the intra-family split represents an easier setting for phenotype prediction, not only because 
individuals in the test set are more related genetically to those in the training set, but also because the 
individuals in the test set often share a similar environment with those in the training set (specifically, in 
the intra-family split, many individuals in the test set share a cage with individuals in the training set, 
but this is not the case in the inter- family split). 

We apply each method using genotypes only, without other covariates. We obtain effect size estimates 
in the training data set, and assess prediction performance using these estimates in the test set by root 
of mean square error (RMSE), where the mean is across individuals in the test set. We contrast the 
performance of other methods to BSLMM by calculating the RMSE difference, where a positive number 
indicates worse performance than BSLMM. We perform 20 inter-family splits and 20 intra-family splits 
for each phenotype. 

Figure HI summarizes the prediction accuracy, measured by RMSE, of each method compared against 
BSLMM. Measuring prediction performance by correlation gives similar results (Figure [S5j). For the low- 
heritability trait BMI, where no measured SNP has a large effect, all methods perform equally poorly. For 
both the more heritable traits, CD8 and MCH, BSLMM consistently outperformed all other methods, 
which seem to split into two groups: LMM, LMM-Bayes and Bayesian Lasso perform least well and 
similarly to one another on average; BVSR, BaycsA-Flcx, BayesC7r and BSLMM-EB perform better, and 
similarly to one another on average. A general trend here is that accuracy tends to increase as model 
assumptions improve in their ability to capture both larger genetic effects, and the combined "polygenic" 
contribution of smaller genetic effects (and possibly also confounding environmental effects correlating 
with genetic background). In particular, the ti distribution underlying BaycsA-Flcx, which has a long tail 
that could capture large effects, performs noticeably better than either the normal or double-exponential 
distributions for effect sizes underlying LMM and Bayesian Lasso. 

Comparisons of pairs of closely-related methods yield additional insights into factors that do and 
do not affect prediction accuracy. The fact that BSLMM performs better than BSLMM-EB illustrates 
how the approximation used in BSLMM-EB can degrade prediction accuracy, and thus demonstrates the 
practical benefits of our novel computational approach that avoids this approximation. Similarly, the 
superior performance of BaycsA-Flex over BayesA (which performed poorly; not shown) also illustrates 
the benefits of estimating hyper parameters from the data, rather than fixing them to pre-specified values. 

lr This is a slight abuse of terminology, since in this method the estimated hyper parameters are obtained under the model 
with 3 = 
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The similar performance between BVSR and BayesC7r, which fit the same model but with different priors, 
suggests that, for these data, results are relatively robust to prior specification. Presumably, this is 
because the data are sufficiently informative to overwhelm the differences in prior. 

Computational Speed 

The average computational time taken for each method on the Mouse data is shown in Table [3] Some 
differences in computational time among methods may reflect implementational issues, including the lan- 
guage environment in which the methods are implemented, rather than fundamental differences between 
algorithms. In addition, computing times for many methods will be affected by the number of iterations 
used, and we did not undertake a comprehensive evaluation of how many iterations suffice for each algo- 
rithm. Nonetheless, the results indicate that our implementation of BSLMM is competitive in computing 
speed with the other (sampling-based) implementations considered here. 

In particular, we note that BSLMM is computationally faster than BVSR. This is unexpected, since 
BSLMM is effectively BVSR plus a random effects term, and the addition of a random effects term usually 
complicates computation. The explanation for this is that the (pcr-iteration) computational complexity 
of both BSLMM and BVSR depends, quadratically, on the number of selected SNPs in the sparse effects 
term (I7Q, and this number can be substantially larger with BVSR than with BSLMM, because with 
BVSR additional SNPs are included to mimic the effect of the random effects term in BSLMM. The size 
of this effect will vary among data sets, but it can be substantial, particularly in cases where there are a 
large number of causal SNPs with small effects. 

To illustrate this, Table |4] compares mean computation time for BSLMM vs BVSR for all data sets 
used here. For simulated data with a small number of causal SNPs, BSLMM and BVSR have similar 
computational times. However, in other cases (e.g. PVE=0.6, S=10,000 in Scenario I) BSLMM can be 
over an order of magnitude faster than BVSR. In a sense, this speed improvement of BSLMM over BVSR 
is consistent with its hybrid nature: in highly polygenic traits, BSLMM tends to behave similarly to 
LMM, resulting in a considerable speed gain. 
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Discussion 

We have presented novel statistical and computational methods for BSLMM, a hybrid approach for 
polygenic modeling for GWAS data that simultaneously allows for both a small number of individually 
large genetic effects, and combined effects of many small genetic effects, with the balance between the 
two being inferred from the data in hand. This hybrid approach is both computationally tractable for 
moderately large data sets (our implementation can handle at least 10,000 individuals with 500,000 SNPs 
on our moderately-equipped modern desktop computer), and is sufficiently flexible to perform well in 
a wide range of settings. In particular, depending on the genetic architecture, BSLMM is either as 
accurate, or more accurate, than the widely-used LMM for estimating PVE of quantitative traits. And 
for phenotype prediction BSLMM consistently outperformed a range of other approaches on the examples 
we considered here. By generalizing two widely-used models, and including both as special cases, BSLMM 
should have many applications beyond polygenic modeling. Indeed, despite its increased computational 
burden, we believe that BSLMM represents an attractive alternative to the widely-used LASSO [70] for 
general regression-based prediction problems. 

Although it was not our focus here, BSLMM can be easily modified to analyze binary phenotypes, 
including for example, a typical human case-control GWAS. For PVE estimation, one can directly apply 
BSLMM, treating the 1/0 case-control status as a quantitative outcome, and then apply a correction 
factor derived by [24] to transform this estimated PVE on the "observed scale" to an estimated PVE 
on a latent liability scale. This correction, for which we supply an alternative derivation in Text S3, 
corrects for both ascertainment and the binary nature of case-control data. For phenotype prediction, 
one can again directly apply BSLMM, treating the 1/0 case-control status as a quantitative outcome, and 
interpret the resulting phenotype predictions as the estimated probability of being a case. Although in 
principle one might hope to improve on this by modifying BSLMM to directly model the binary outcomes, 
using a probit link for example, we have implemented this probit approach and found that not only is 
it substantially more computationally expensive (quadratic in n instead of linear in n) , but it performed 
slightly worse than treating the binary outcomes as quantitative, at least in experiments based on the 
mouse phenotypes considered here (Text S3). This may partly reflect inaccuracies introduced due to the 
greater computational burden. 

The computational innovations we introduce here, building on work by [3|9], make BSLMM con- 
siderably more tractable than it would otherwise be. Nonetheless, the computational burden, as with 
other posterior sampling based methods, remains heavy, both due to memory requirements (e.g. to store 
all genotypes) and CPU time (e.g. for the large number of sampling iterations required for reasonable 
convergence). Although more substantial computational resources will somewhat increase the size of data 
that can be tackled, further methodological innovation will likely be required to apply BSLMM to the 
very large data sets that are currently being collected. 

In addition to providing a specific implementation that allows BSLMM to be fitted to moderately 
large data sets, we hope that our work also helps highlight some more general principles for improving 
polygenic modeling methodology. These include: 

1. The benefits of characterizing different methods by their effect size distribution assumptions. While 
this point may seem obvious, and is certainly not new (e.g. |40II71| ). polygenic modeling applications 
often focus on the algorithm used to fit the model, rather than the effect size distribution used. 
While computational issues are very important, and often interact with modeling assumptions, we 
believe it is important to distinguish, conceptually, between the two. One benefit of characterizing 
methods by their modeling assumptions is that it becomes easier to predict which methods will 
tend to do well in which settings. 

2. The importance of selecting a sufficiently flexible distribution for effect sizes. The version of BSLMM 
we focus on here (with K = ^XX T ) assumes a mixture of two (zero- mean) normals for the effect 
size distribution. Our examples demonstrate the gain in performance this achieves compared to less 
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flexible distributions such as a single normal (LMM) or a point- normal (BVSR). More generally, in 
our phenotypc prediction experiments, methods with more flexible effect size distributions tended 
to perform better than those with less flexible distributions. 

3. The importance of estimating hyper-parameters from data, rather than fixing them to pre-specified 
values. Here we are echo-ing and reinforcing similar themes emphasized by [53] and [T5]. Indeed, 
our comparison between BSLMM and BSLMM-EB for phenotype prediction illustrates the benefits 
not only of estimating hyper-parameters from the data, but of doing so in an integrated way, rather 
than as a two-step procedure. 

4. The attractiveness of specifying prior distributions for hyper-parameters by focusing on the pro- 
portion of variance in phenotypes explained by different genetic components (e.g. PVE and PGE 
in our notation). This idea is not limited to BSLMM, and could be helpful even with methods that 
make use of other effect size distributions. 

One question to which we do not know the answer is how often the mixture of two normal distributions 
underlying BSLMM will be sufficiently flexible to capture the actual effect size distribution, and to what 
extent more flexible distributional assumptions (e.g. a mixture of more than two normals, or a mixture 
of t distributions with degrees of freedom estimated from the data) will produce meaningful gains in 
performance. It seems likely that, at least in some cases, use of a more flexible distribution will improve 
performance, and would therefore be preferable if it could be accomplished with reasonable computational 
expense. Unfortunately some of the tricks we use to accomplish computational gains here may be less 
effective, or difficult to apply, for more flexible distributions. In particular, the tricks we use from [5] 
and [3] may be difficult to extend to allow for mixtures with more than two components. In addition, for 
some choices of effect size distribution, one might have to perform MCMC sampling on the effect sizes /3 
directly, rather than sampling 7, integrating out analytically, as we do here. It is unclear whether this 
will necessarily result in a loss of computational efficiency: sampling (3 reduces computational expense 
per update at the cost of increasing the number of updates necessary (sampling 7 by integrating over 
/3 analytically ensures faster mixing and convergence [Z21E3]). Because of these issues, it is difficult to 
predict which effect size distributions will ultimately provide the best balance between modeling accuracy 
and computational burden. Nonetheless, compared with currently available alternatives, we believe that 
BSLMM strikes an appealing balance between flexibility, performance, and computational tractability. 
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Text SI Detailed Methods 

GWAS Datasets 

We used four GWAS data sets in the present study. 

The first data set contains height measurements for 3925 Australian individuals [35]. We used the 
data set for simulation and PVE estimation. The phenotypes were already regressed out of age and sex 
effects, and were quantile normalized to a standard normal distribution afterwards [22]. A total of 294,831 
SNPs were available after stringent quality control [22] . We imputed missing SNPs using IMPUTE2 [74] , 
and further excluded five SNPs that have minor allele frequencies below 1% after imputation. 

The second data set consists of blood lipid measurements for 1868 individuals [75]. We used this data 
for PVE estimation. The individuals came from two study groups: the Cholesterol and Pharmacoge- 
netics (CAP) group [76] and the Pravastatin Inflammation/CRP Evaluation (PRINCE) group [77] ■ Trie 
PRINCE study consists of two cohorts: one contains individuals with history of cardiovascular diseases 
(CVD) and the other contains individuals with no history of CVD. From both study groups, we selected 
all 1868 individuals that have complete low-density lipoprotein (LDL) subfraction measurements. We 
selected four different blood lipid measurements as phenotypes in the present study: LDL, high-density 
lipoprotein (HDL), total cholesterol (TC) and triglycerides (TG). Phenotypes were quantile-normalized 
to a standard normal distribution within each group, corrected for covariates including BMI (body mass 
index), age, sex, and smoking status effects, and quantile-normalized again [75] . Individuals were typed 
on two different SNP arrays (Illumina HumanHap300 and HumanQuad610 bead chips, Illumina, San 
Diego, CA). We used all SNPs that appeared in either of the arrays and we imputed missing genotypes 
using IMPUTE2 [7J. We obtained a total of 582,962 SNPs and we used 555,601 polymorphic SNPs with 
minor allele frequency above 1% for analysis. 

The third data set is from the Wellcome trust case control consortium (WTCCC) 1 study [62]. We 
used this data set to assess phenotype prediction performance. The data set consists of about 14,000 
cases of seven common diseases, including 1868 cases of bipolar disorder (BD), 1926 cases of coronary 
artery disease (CAD), 1748 cases of Crohn's disease (CD), 1952 cases of hypertension (HT), 1860 cases 
rheumatoid arthritis (RA), 1963 cases of type 1 diabetes (T1D) and 1924 cases of type 2 diabetes (T2D), 
as well as 2938 shared controls. Wc obtained quality controlled genotypes from WTCCC and we further 
imputed missing genotypes using BIMBAM [78], which resulted in a total of 458,868 shared SNPs. All 
polymorphic SNPs with minor allele frequency above 1% in the training data were used for prediction 
(about 400,000 SNPs; depending on the disease and the split). 

The fourth data set comes from a genetically heterogeneous stock of mice, consisting of 1904 indi- 
viduals from 85 families, all descended from eight inbred progenitor strains [67]. We used this data set 
to assess phenotype prediction performance of several methods. Multiple phenotype measurements are 
available for the data set, and we selected three phenotypes among them: percentage of CD8+ cells (CD8, 
n = 1410), mean corpuscular hemoglobin (MCH, n = 1580) and body mass index (BMI, n = 1828). We 
selected these phenotypes because they were previously used for comparing prediction performance of 
various methods [37H39] . and they represent a wide range of narrow sense heritability: CD8 has a high 
hcritability, MCH has a median heritability and BMI has a low heritability [69]. All phenotypes were 
already corrected for sex, age, body weight, season and year effects [67], and we further quantile normal- 
ized the phenotypes to a standard normal distribution. A total of 12,226 autosomal SNPs were available 
for all mice. For individuals with missing genotypes, we imputed missing values by the mean genotype 
of that SNP in their family. All polymorphic SNPs with minor allele frequency above 1% in the training 
data were used for prediction (about 10,000 SNPs; depending on the phenotype and the split). 
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Simulations 

We used genotypes from the human height data set [22] described above and simulated phenotypes from 
the simple linear model (UJ with different assumptions for the distribution of effect sizes 3. We consider 
two simulation scenarios where the true PVE is known and we simulated 20 independent sets of phenotype 
data in each case. 

Scenario I: the effect sizes of causal SNPs come from a normal distribution. We randomly chose a fixed 
number of causal SNPs (10, 100, 1000, 10000) and simulated their effect sizes from a N(0, 1) distribution. 
We drew the errors from a normal distribution with variance chosen to achieve a given PVE (0.2 and 
0.6). 

Scenario II: the effect sizes of causal SNPs come from a mixture of two normal distributions, such 
that a small group of causal SNPs have additional effects. We first randomly chose a large number of 
causal SNPs (10000), and among them, we further selected a small number of medium effect size SNPs 
(10 or 100) and used what were left as small effect size SNPs (9990 or 9900). We simulated the small 
effect sizes for all causal SNPs (10000) from a N(0, 1) distribution. Afterwards, we drew additional effect 
sizes (in addition to the small effects already drawn) for those medium effect SNPs (10 or 100) from 
a N(0, 1) distribution, and scaled these additional effect sizes further so that together they explained 
a fixed proportion of genetic variance, or PGE (0.1 and 0.2, for 10 and 100 medium effect size SNPs, 
respectively). Once we obtained the final effect sizes for all causal SNPs, we drew errors to achieve a 
given PVE (0.2 and 0.6). 

Assessing Prediction Performance in Simulations 
MSPE and RPG 

We assess prediction accuracy mainly using mean square prediction error (MSPE), and a rescaled version 
of MSPE called relative predictive gain (RPG). The MSPE for predicting a future observation in the 
simple linear model (P) depends on comparing an estimated value of 3, 0, the true value of 3, and the 
error variance r, as follows: 

MSPE03;/3,r) :=E(x. T p-y) 2 (24) 
p 

= £(£>*(&- A)) 2 ) +r" J (25) 

= E E r v & - - ft ) + r_1 ( 26 ) 

t=l 3=1 

« E ft) +T" 1 , (27) 

|i-3|<20 

where y is the phenotype for a future observation, x is the corresponding p-vector of genotypes, = 
E{xiXj) is the covariance between SNP i and j. In practice, we approximate r.y with the sample co- 
variance Sij = — XikXjk, and we only consider Sy- for neighboring SNPs that satisfy \i — j\ < 20. 
This is because linkage dis-cquilibrium (LD) decays with distance and remote SNPs are approximately 
independent with each other. The above definition of MSPE extends the definition in [19] to take into 
account correlations among neighboring SNPs. 

We denote MSPEo as the MSPE obtained using only the mean of the phenotype (i.e. y) for prediction, 
and we define RPG as the rescaled version of MSPE following [T9"] : 

RPCffl 3) - MSPEo -MSPE(fr/3,r) 

RPG(/3,/3) - MSPE -MSPE(/3; AT)' (28) 
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When applying these formulae for LMM or BSLMM note that we use estimates for in (JT|), and 
not for in ^ . This ensures that the resulting predictions take account of both the sparse effects and 
the random effect u in ([6|) . The way we obtain these estimates is detailed below. 



Correlation 

We also assess prediction accuracy using correlation. The correlation between the predicted value and 
the true value for a future observation in the simple linear model (JXJ) depends on 0, and r as follows: 

Cor(0;0,r):=CoT(x T 0,y) (29) 

mEU*MEU*3fr)) (30) 

(31) 
(32) 



£((Ei=i *i&) 2 )£((ELi ^-) 2 )/PVE(/3, r) 



-i|<20 Si J 
|<20 Si i |<20 Si i 



Again, when applying these formulae for LMM or BSLMM note that we use estimates for in (|T|), 
and not for in ©. 



Details of BSLMM 
Centering X and K 

We assume that the genotypes in X have been measured on bi-allelic markers, and that the genotypes 
at each marker are coded as 0, 1 or 2 copies of some reference allele. (For imputed genotypes we use the 
posterior mean genotype [75].) It occasionally simplifies the algebra to assume that each column of X 
is centered to have mean 0; since the results will be the same with or without centering, we make this 
assumption throughout. It is also common to standardize the columns of X to have unit variance. In 
contrast to centering, standardizing the columns will affect the results, and we do not standardize the 
columns in our applications here, although all our methods could be applied with the matrix standardized 
in this way. (Standardizing the columns of X corresponds to making an assumption that rarer variants 
tend to have larger effects than common variants, and precisely that marker effect sizes tend to decay with 
the inverse of the genotype variance; see [79j[80] for relevant discussion.) In summary, throughout this 
paper = (x^j — Xj ) where x\j is the number of copies of the reference allele at marker j in individual 
i, and xj := (1/n) Xij . 

To facilitate prior specification, in addition to centering the genotype matrix X, we also assume 
that the relatedness matrix K is "centered" , in the sense that the random effects have mean zero: 
E™=i Ui = 0- This holds automatically for K oc XX T , with X centered. More generally it can be 
achieved by multiplying the relatedness matrix with a projection matrix on both sides: MKM, where 
M = I„ — 1„(1^1„) _1 1^. The resulting transformed relatedness matrix is positive-semidefinite as long 
as the original relatedness matrix is positive-semidefinite. 
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Definition and derivation of expressions for h and p 

We define h, p 

h{wb):= ^W + u)) (33) 
ECVCXP)) 

p(n,a a ,<7 b ):= v - H " , (34) 

rV ; £(V(X/3 + u)) 

where the function V(x) is defined in equation IT51 and the expectations are taken with respect to (p, u), 
conditional on hyper parameters (a a , a b , 7r, t). These conditional expectations are extensions of, and slight 
simplifications of, the similar expression for h in |19j : the simplification comes from taking expectations 
conditional on 7r instead of conditional on f. These definitions of h and p are motivated by approximating 
the expectations of PVE and PGE by the ratios of the expectations of the numerator and denominator. 
Both h and p take values between and 1 and serve as rough guides to the expectations of PVE and 
PGE, respectively. 

The expectations in the above expressions, conditional on hyper-parameters (cr 2 , cr 2 , tt, t), can be 
obtained as: 

p 

E(V(Xp)\a 2 a , tt, t~ v ) = E(£ Vixifalo*,*, r" 1 ) = p7nt a o*T-\ (35) 

»=i 
p 

E(V(X^ + u)|a2, CT ^7r,r- 1 )=E(^V(x J /30+V(u)| CT ^a^^,T- 1 )=p^ aCTa 2 r- 1 + Sh( 7 h 2 r- 1 , (36) 



where x 4 is the ith column of X, s a = i Yh=i Z)"=i ^iji s & = x »j and % are tne *J tn 

elements of matrices X and K, respectively. The above derivation assumes centered genotypes and 
rclatcdncss matrix. 

Plugging these approximations into the expressions (fT"3")l and (ITU) gives ([To]) and (jTTJ) . 



Induced Priors on cr 2 and cr 2 



Solving (fTo]) and (JTTJ) for cr 2 and a 2 as functions of /i, p and 7r gives: 



2 hp 

^ ~ (l-h)pns a > (37) 



2_ fe(l-p) 



= n n • (38) 

(1 - /l)Sb 

The independent priors (|18p . (fT"9|). (JT2]) for (ft., p, n) induce a joint prior on (<t , erf,, 7r): 

^2 ^ ^ P s a s b (39) 

which is heavy tailed for (cr 2 , cr 2 ) marginally (i.e. tail has a polynomial decay), a feature desirable in 
association studies |80| . The priors also have another nice property that the marker effect size variance 
cr 2 tends to decrease as the proportion of markers with an effect (tt) increases. 

PVE Estimation with LMM, BVSR and BSLMM 

We could estimate PVE using the posterior mean of the MCMC samples for all three models, and we do 
so for both BVSR and BSLMM. For LMM, we follow previous studies [22] and use an approximation to 
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the PVE. In particular, wc consider the LMM defined by equations ©-(O with = 0, and we estimate 
PVE by the ratio of expectations 

PVE = !f fc , (40) 

where b\ is the REML estimate for the variance component. This formula can be viewed as a general- 
ization of the form used in [22] . and is valid for any choice of centered relatcdncss matrix K. 

To obtain the standard error of the above estimate, we compute the second derivative of the log 
restricted likelihood function with respect to <j\ [9] and evaluate it at a\: 

r \ b ) ~ -Q2^2\*t=*i ~ 2 tiaC °( PKPK ^ 2 y^P^ ltT '=^' ^ ' 

where l r denotes the log restricted likelihood, H = o^K + I„ and P = H _1 -H _1 l„(l n H _1 l^) _1 l^H _1 . 
Despite its complicated form, the second derivative can be easily and efficiently evaluated using recursions 
in [9]. As the variance of a\ is asymptotically v(ct^) = —l/l"(a%), using the delta method, we approximate 
the standard error of PVE by 



se(PVE) - J-^^l). (42, 

As a check on correctness, we also used the posterior mean for PVE obtained from LMM-Bayes to 
estimate PVE. This gave almost identical results in all cases considered here and therefore only results 
from LMM are presented. 

Phenotype Prediction with LMM, BVSR and BSLMM 
Real Data 

In the real data, we can perform prediction by estimating both the sparse effects (//, (3) and the random 
effects (u) in (J6j) from the training set, and use these to predict phenotypes in the test set. 

Let /t D , (3 a and u D denote estimates for the sparse and random effects obtained from the observed 
(training) sample. For BSLMM and BVSR these estimates are the posterior means for these parameters, 
estimated from MCMC samples (for BVSR u Q = 0). For LMM u Q is obtained as the conditional posterior 
mean of u Q given the REML estimate for a\ (i.e. BLUP). 

We then obtain predictions for a future (test) sample as follows. For a general relatedness matrix 
K, we assume that the random effects for the observed and future samples follow a multivariate normal 
distribution 

;-)~MVN„. + „,((2),(^ (43, 

where n and n f are the sample size for observed (training) and future (test) data, respectively. Standard 
multivariate normal theory gives the conditional distribution 

u/K ~ MVN n/ (K/oK^u,,, Kff - K/oK^K,,/). (44) 

We use the conditional mean as an estimate for u/ and thus the predicted phenotypes for future 
observations are 

y f = l n pL + X//3 Q + K f0 K-^u . (45) 
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Simulated Data 

In the simulated data we use RPG and correlation to assess prediction accuracy. To compute RPG and 
correlation we need to obtain estimates for /3 in the simple linear model (TTJ). To do this, in the special 
case when the relatedness matrix K = iXX T , we rewrite the model (|6]) as 



y = l n p + X(3 + Xa + e, (46) 

e-MVN^O.r- 1 ^), (47) 

ft ~ 7rN(0, er^T -1 ) + (1 — 7r)^o, (48) 

a^N(0,a b 2 /(pr)), (49) 



where we can think of the p- vector a as representing the "small" effect sizes present at every locus. The 
special case of tt = ((3 = 0) gives LMM, and the special case of <r 2 = (a = 0) gives BVSR. Note that 
a + (3 = (3, so summing estimates of a and (3 yields an estimate for (3 in (fT). 
For LMM we estimate a. by its conditional expectation 

A 2 

d=^X T (^K + I„)- 1 y J (50) 
P 

where a\ is the REML estimate of the variance component in the observed data. Since = in LMM, 
this estimate for a provides the required estimate for (3 in ([1}. 
For BVSR, we use the posterior mean of (3 (since a. = 0). 

For BSLMM, we use Rao-Blackwellisation to obtain an approximation to the posterior mean for a. 
(Text S2), and then add this to the (approximate) posterior mean for (3 obtained from the MCMC sampler 
to obtain an approximation for the posterior mean of f3 in ([T]). 

Other Methods 

1. LMM: We fit LMM using the GEMMA algorithm [9]. 

2. BVSR: Wc fit BVSR by fixing p = 1 in BSLMM using our software. This gives slightly better 
results, and is faster than the BVSR software piMASS (version 0.90) [T5], in all examples considered 
here. 

3. LMM-Bayes: Wc fit this by fixing p — in BSLMM using our software. 

4. Bayesian Lasso: This [S5] assumes a double-exponential prior for each coefficient ft in (TTJ): 

/3, |A ~ DE(0, A -1 ), A 2 - Gamma(Ki,K 2 ), t" 1 - IG(k 3 , k a ), (51) 

where DE denotes the double exponential (Laplace) distribution with mean and scale parameter 
A , and Gamma denotes a Gamma distribution with shape and rate parameters. We set k± = 
0.55, K2 = 10~ 6 ,K3 = 1/2 and = 1/2 following previous studies [23|39]. We use a conjugate 
Gamma prior for A 2 as in [55] instead of a Beta prior for A/100 as in [35]. We used the R package 
BLR [3j2] to sample from the posterior distribution of (3. 

5. BayesA-Flex: This assumes a scaled t-distribution for each coefficient ft in (JlJ: 

ft|cr -t(0,^,cr 2 ), cr 2 ~ IG(ki,k 2 ), t~ X - IG(k 3) « 4 ), (52) 

where IG stands for the inverse gamma distribution. Following previous studies, we set the degree 
of freedom parameter v to 4 [27,34,40] and set K3 = 1/2 and K4 = 1/2 [3pJ. We also consider 
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the posterior distribution where ki — > and K2 — > 0. The above model is similar to BayesA [27] , 
but with a key difference in the way the scaling parameter a 2 is treated: BayesA fixes a 1 to some 
pre-specified value, whereas here we specify a prior for a 2 and allow it to be estimated from the data 
(and hence the name "BayesA-Flex" ) . Using BayesA in this data set gives poor results [35] (and 
data not shown); but estimating the scaling parameter greatly improves prediction performance. 
We modified the R package BLR [39] to obtain posterior samples from this model. The modified 
code is freely available online. 

6. BayesC7r: we fit this using the online software GenSel [34] . 

7. BSLMM-EB: we fit this using our BSLMM software, fixing a\ to the REML estimate a\ from the 
null model (i.e. LMM). This approximation avoids updating a\ in each iteration of the MCMC, and 
is one of the several approximation strategies used by previous studies to alleviate the computation 
burden of models similar to BSLMM [37,41 . Intuitively, by fixing the variance component to its 
null estimate, BSLMM-EB discourages the inclusion of large effect SNPs into the sparse effects 
term and risks underestimating their effect sizes. Therefore, this approximation may reduce the 
prediction performance of BSLMM, especially when there are large effect SNPs. We confirm this 
in the real data set. 

For all MCMC based methods except for BayesC7r, we run 2.1 million iterations with the first 0.1 million 
iterations as burn- in steps. For BayesC7r, due to web server restriction, we run 1.1 million iterations with 
the first 0.1 million iterations as burn-in. 
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Text S2 Detailed MCMC Strategy for BSLMM 

To simplify notation, we assume in this section that y is centered. We use Markov chain Monte Carlo 
to obtain posterior samples of (h, p, 7r, *y) on the product space (0, 1) x (0, 1) x (0, 1) x {0, 1} P , which is 
given by 

P(h, p, ir, 7 |y) cx P(y\h, p, n, y)P(h)P(p)P(~f\ir)P(Tr). (53) 

In the above equation, we explored the fact that the parameters /3, u and r can be integrated out 
analytically to compute the marginal likelihood P(y\h, p, ir, 7). The marginal likelihood is 

P(y|fc,p,7r,7) oc \H\-i\a- 2 n\?(y T Py)-%, (54) 

where H(a 2 ) = <7 2 K+I„, ft(^ 6 2 ,7) = (X^H-^+a- 2 ^,)- 1 , P(a 2 ,a 2 , 7 ) = H 1 H ^X^X^H 1 
Notice again that <r 2 is a function of h, p and ir, while <r 2 is a function of h and p. 

To efficiently evaluate the marginal likelihood, we perform an eigen decomposition of the relatedness 
matrix K = UDU T at the beginning of the MCMC, where U is the matrix of eigen vectors and D is a 
diagonal matrix of eigen values. We transform both the phenotype vector and the genotype matrix by 
multiplying the eigen matrix and calculate U T y and U T X. Afterwards, as has been shown previously, 
the calculations of the determinant and the inverse of matrix H, as well as the vector-matrix-vector form 
y T Py, in each iteration of the MCMC, arc easy to perform [HE]. 

We use a standard Metropolis-Hastings algorithm to draw posterior samples of the hyper-parameters 
(h, p, 7r,7) based on the above marginal likelihood. Following [19], we use a rank based proposal dis- 
tribution for 7, and use random walk proposals based on uniform distributions for h, p and log(-7r). In 
particular, we first obtain single-SNP p values using a standard LMM with GEMMA algorithm [9], and 
then rank SNPs based on these p values from small to large. Our aim is to use a proposal distribution 
for 7 that puts more weights on SNPs that arc ranked higher by the single SNP tests, and to do this we 
consider a mixture distribution Q p = 0.3C/ p + 0.7G P , where U p is a uniform distribution on 1, ■ • • ,p and 
G p is a geometric distribution truncated to 1, ■ • • ,p with its parameter chosen to give a mean of 2000. 
We denote 7 + = {i : 7^ = 1} and we propose the new 7 by randomly choose one of the following steps: 

• add a covariate with probability 0.4: generate r from Q p until the covariate with rank r is not in 
7 + , then add this covariate to 7 + 

• remove a covariate with probability 0.4: pick a covariate in 7 + uniformly at random and remove it 
from 7 + 

• switch a pair of covariates with probability 0.2: pick up two covariates by the above two steps, and 
switch their indicator values 

For the other hyper-parameters, we update log(7r) by adding a random variable from U(-0.05, 0.05) 
to the current value, and update h and p by adding a random variable from U(-0.1, 0.1) to the current 
values. New values of h and p that lie outside the boundary [0,1] arc reflected back. 

In addition to the above local proposal distributions, we also use a "small world proposal" which 
improves theoretical MCMC convergence (5TJ. In brief, with probability 0.33 in each iteration, we make a 
longer-range proposal by compounding many local moves, where the number of compounded local moves 
is draw uniformly from 1 to 20. 

For each sampled values of (h, p, tt, 7), we further obtain samples of r and (3 using the conditional 
distributions P(r\y, h, p, 7r, 7) and P((3\y, h, p, 7r, 7, r) listed below: 

71 y^Py 

T|y,/i,P,7r,7 ~ Gamma(-, — - — ), (55) 

3 7 |y, h, p, tt, 7, r ~ MVN| 7 , (nX^H-V, r-'n), (56) 

/3_ 7 |y,/i,P,7r,7,i- ~ S . (57) 
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Afterwards, we sample u based the conditional distribution P(u|y, h, p, tt, 7, r, (3): 

u|y, h, R, 7r,7, t, ~ MVNn^KH" 1 ^ - X 7 ^ 7 ), of KH^r" 1 ). (58) 

However, we do not sample u directly from the above n-dimensional multivariate normal distribution. 
Instead, we sample U T u (and we never need to obtain u), as the conditional distribution of each element 
in U T u is a normal: 

U T u|y, h, p, tt, 7, r, p ~ MVN n (ofD(o 6 2 D + IJ-^y - U T X 7 ^ 7 ), a 2 D(<7 2 D + I)" 1 ^ 1 ). (59) 

where the covariance matrix is diagonal. 

For each sampled value of (J3, u, r), we obtain samples of PVE and PGE based on equations (Q~3|) 
and dHJ). 

When required (e.g. for evaluating RPG in simulation studies), in the special case K = XX T /p, we 
also obtain the (approximate) posterior mean of a. in the alternative model formulation (|46|) - (j49|) . This 
is achieved without sampling a in each iteration using the fact that the full conditional distribution of 
a given other sampled values is 

a|y, h, p, tt, 7, r, ~ MVN„(a b V 1 X T H- 1 (y - X 7( 3 7 ), a 2 b (p-% - p- 2 ( r 2 X T H- 1 X)r- 1 ), (60) 

which leads to the Rao-Blackwellised approximation for the posterior mean of a: 

d^EEColy.tfVV^V^ (61) 
i=l ^ t=l 

where T is the total number of MCMC iterations, and the superscript (t) denotes the tth MCMC sample. 
Notice that we only need to do the p dimensional matrix- vector multiplication once at the end. 

When I7I is large, the most time consuming part of our MCMC scheme for fitting BSLMM and BVSR 
is the calculation of fi. The per- iteration computation time of the above algorithm is comparable to that 
of BVSR j!9j with linear complexity in the number of individuals but quadratic complexity in | — y | _ In 
practice, to reduce the computation burden, we set a maximal value for | — y | (300 for simulations and the 
two human data sets, 600 for the mouse data set). Setting the maximal value to a larger number (600) 
in simulations improves results only subtly, even for scenarios where a large number of causal SNPs is 
present. 
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Text S3 The Probit BSLMM and Binary Traits 

MCMC strategy 

We use "probit BSLMM" to refer to a BSLMM with a probit link to model binary traits: 

P(y t = l|xi, 0, Ui) = 1 - P{y t = O\xi,0, u t ) = $(jt + x iy 3 + Ui ) (i = 1, • • • , n), (62) 

where y t is the binary trait for ith individual, jq is the ith row-vector of X, it, is ith element of random 
effects vector u and $ is the cumulative distribution function (CDF) of the standard normal distribution. 
Following [55], we introduce a vector of auxiliary variables z and obtain the equivalent latent variable 
model as: 

/ 1 if Zi > . . 

yt = \ ifz,<0 ' (63) 

Zi = ii + Xi/3 + m + ei ei~N(0,l), (64) 

where Zi is ith element of vector z. 

We use the same prior specifications for the hyper-parameters as described in the main text (except 
that t = 1 here). We use a similar MCMC strategy as described in Text S2 to sample posteriors, with 
an additional step to sample the posteriors of the latent variables z using the conditional distribution 
P(z|y, 7 ,Au): 

z i\lJi — L 0, Ui ~ N(/i + x 7 j/3 + Ui, 1) left truncated at 0, (65) 
Zi\yi = 0, 0, Ui ~ N(/i + x*yi0 + Ui, 1) right truncated at 0. (66) 

We denote z as the sample mean of z, or z = - J27=i z i- Conditional on the latent variables z, the posterior 
sampling for the hyper-parameters {h,p, 7r, *y) is based on the marginal likelihood P{h, p, it, ~y\z), which 
is slightly different from that in Text S2 as we do not integrate out r here: 

P(z\h,p,ir,~f) oc |H|-'|<T- 2 n|*e-i(— 1 »*) !rp ("- 1 » a 5. (67) 

After obtaining the posterior samples of the hyper-parameters, we sample the posteriors of and u using 
conditional distributions identical to those listed in Text S2 by setting r = 1. Finally, we sample p, based 
on the conditional distribution P(p\z, 7, 0, u): 

/i|z, 7 , 0, u ~ N(-l£(z - X 7( 3 7 - u), -). (68) 
n ' n 

For efficient calculation of the above marginal likelihood function P(z\h, p, tt, 7), we use the same 
strategy as described in Text S2. However, as a transformation of the latent vector z to U T z as well 
as transformations of U T u and U T X/3 back to u and X/3 are needed in every Gibbs iteration, the per- 
iteration computational cost of the probit BSLMM increases quadratically with the number of individuals. 

Application to Mouse Data 

To generate a binary data set on which to illustrate the probit BSLMM and compare its performance 
with BSLMM, we use the mouse data from the main text, transforming the quantitative values of the 
three traits into binary values by assigning the half individuals with higher quantitative values to 1 and 
the other half to 0. We consider two different approaches here: (linear) BSLMM and probit BSLMM. The 
BSLMM can be viewed as a first order approximation to its probit counterpart. We use Brier score in 
the test sample to evaluate prediction performance. For BSLMM, we threshold the predicted probability 
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values that are above 1 to be exact 1 and those below to be exact 0. We contrast the performance 
of the probit BSLMM against BSLMM by calculating the Brier score difference, where a positive value 
indicates worse performance than BSLMM. 

Figure [S6] shows Brier score differences for the three traits. Interestingly, for the three traits here, 
treating binary values as quantitative traits using BSLMM works better than modeling them directly 
using the probit BSLMM. This may partly reflect numerical inaccuracies due to the greater computational 
burden of fitting the probit BSLMM. 



Correction factor for estimating PVE for case-control data 

Here, we provide an alternative way to derive the correction factor, that appeared in [24], for transforming 
PVE estimate in the observed scale back to that in the liability scale. Our approach is based on Taylor 
series approximation. To simplify notation, wc denote k p = P p (yi = 1) as the case proportion in the 
population, k a = P a {Vi = 1) as the case proportion in the ascertained case-control sample, $ as the 
normal CDF (cumulative distribution function), <j) as the normal PDF (probability distribution function), 
Hp satisfies &((i p ) = k p , Ha satisfies 3>(/U ) = k a , and z p = 4>(hp)- 

First, we assume, following [24], a probit model on the population scale: 

P p (y l = l\xi,0,Ui) = 1 - P P {y l = 0|xi,j9,Ui) = ®(h p + x t /3 + u t ) (i = 1, • • • , N), (69) 

where N is the population sample size. 

The conditional distribution in the ascertained case-control sample can be derived by Baycs theorem 

Pa\Vi = lfa, P,Ui) = — -~- (i = l,---,n), (70) 

P a (xi,Ui\f3) 

where n is the case-control sample size. 

We notice that P a (x-i,Ui\yi = 1,(3) = P p (xi,Ui\yi = 1,(3) holds for ideal case-control studies, as cases 
in the ascertained sample are selected randomly from all cases in the population. Wc assume further 
P P (yi = 1\P) ~ P P (yi = 1) = k p and P a (j/i = 1\P) ~ P a (yi = 1) = k a , that the probability of being a 
case does not depend on parameters, an assumption commonly made (see e.g. |83j ) and likely hold when 
parameters are close to their true values. We further denote a normalizing constant Z = ^°I?£lL?fil|jj a nd 

P p (Xi,Ui\0) 

we have 

P a { yi = l\xi,p,Ui) w -^(jip + XiP + Ui), (71) 

and similarly 

P a {y i =0\x i ,P,u i ) w 1^1^(1 -$(Hp + XiP + Ui)), (72) 
which give the normalizing constant 

Z = ^$( A t P + *i~P + Ui) + i— ^(1 - <S>(ji p + ^P + «<))• (73) 

Kp 1 Kp 

We expand the above two likelihoods using Taylor series expansion with respect to Xj/3 + Ui at 0. If we 
use the linear term only for approximation, we obtain 

P a (y t = l\xi,0,Ui) « k a + Y 1 "^ (xi/B + Ui), (74) 

Kp \ 1 Kp) 

P a ( yi = O\xi,0,Ui) w 1 -k a - ka{ ) ~ ka J Zp (^ + u t ). (75) 
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In other words, the expected value of individual binary label can be approximated by 

E(yi) = P a {Vi = lte.ftuO « K + ka{ ) ~ ka J Zp (^ + u,), (76) 

which suggests using a linear mixed model to treat binary values as quantitative traits to infer the 
parameters. The estimated PVE on the observed scaling using a linear mixed model is 



- fcq(l-fcq) , 2 V(x/3 + M t ) _ ka(l-kg)z^ r/ o fcq(l - fcq>j 



where PVE Q is the PVE estimate on the observed scale, and PVE; is the true PVE on the liability scale 

Therefore, we can use the correctior 
scale back to that on the liability scale 



k (1 — k ) 

Therefore, we can use the correction factor -vrn — i \ i to transform the PVE estimate on the observed 

k a (l-k a )z£ 
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Figure 1. Comparison of PVE estimates from LMM (blue), BVSR (red) and BSLMM (purple) in two 
simulation scenarios. The x-axis show the number of causal SNPs (Scenario I) or the number of 
medium/small effect SNPs (Scenario II). Results are based on 20 replicates in each case. (A) (true 
PVE=0.2) and (C) (true PVE=0.6) show RMSE of PVE estimates. (B) (true PVE=0.2) and (D) (true 
PVE=0.6) show boxplots of PVE estimates, where the true PVE is shown as a horizontal line. Notice a 
break point on the y-axis in (C). 
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Figure 2. Comparison of prediction performance of LMM (blue), BVSR (red) and BSLMM (purple) in 
two simulation scenarios, where all causal SNPs are included in the data. Performance is measured by 
Relative Predictive Gain (RPG). True PVE=0.6. Means and standard deviations (error bars) are based 
on 20 replicates. The x-axis show the number of causal SNPs (Scenario I) or the number of 
medium/small effect SNPs (Scenario II). 
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Figure 3. Comparison of prediction performance of LMM (blue), BVSR (red) and BSLMM (purple) 
for seven diseases in the WTCCC data set. Performance is measured by area under the curve (AUC), 
where a higher value indicates better performance. The order of the diseases is based on the 
performance of BSLMM. The mean and standard deviation of AUC scores for BSLMM in the seven 
diseases are 0.60 (0.02) for HT, 0.60 (0.03) for CAD, 0.61 (0.03) for T2D, 0.65 (0.02) for BD, 0.68 (0.02) 
for CD, 0.72 (0.01) for RA, 0.88 (0.01) for T1D. 



37 



CD8 MCH BMI 
I 1 I II 1 




Figure 4. Comparison of prediction performance of several models with BSLMM for three traits in the 
heterogenous stock mouse data set. Performance is measured by RMSE difference with respect to 
BSLMM, where a positive value indicates worse performance than BSLMM. The x-axis shows two 
different ways to split the data into a training set and a test set, each with 20 replicates. The mean 
RMSE of BSLMM for the six cases are 0.70, 0.80, 0.79, 0.90, 0.98 and 0.99, respectively. 
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Figure SI. Comparison of prediction performance of LMM (blue), BVSR (red) and BSLMM (purple) 
in two simulation scenarios, where all causal SNPs are excluded from the data. Performance is 
measured by Relative Predictive Gain (RPG). True PVE=0.6. Means and standard deviations (error 
bars) are based on 20 replicates. The x-axis show the number of causal SNPs (Scenario I) or the 
number of medium/small effect SNPs (Scenario II). 
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Figure S2. Comparison of prediction performance of LMM (blue), BVSR (red) and BSLMM (purple) 
in two simulation scenarios, where all causal SNPs are included in the data. Performance is measured 
by correlation. True PVE=0.6. Means and standard deviations (error bars) are based on 20 replicates. 
The x-axis show the number of causal SNPs (Scenario I) or the number of medium/small effect SNPs 
(Scenario II). 
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Figure S3. Comparison of prediction performance of LMM (blue), BVSR (red) and BSLMM (purple) 
in two simulation scenarios, where all causal SNPs are excluded from the data. Performance is 
measured by correlation. True PVE=0.6. Means and standard deviations (error bars) are based on 20 
replicates. The x-axis show the number of causal SNPs (Scenario I) or the number of medium/small 
effect SNPs (Scenario II). 



41 




Figure S4. Comparison of prediction performance of LMM (blue), BVSR (red) and BSLMM (purple) 
for seven diseases in the WTCCC data set. Performance is measured by Brier score, where a lower score 
indicates better performance. The order of the diseases is based on the performance of BSLMM. The 
mean and standard deviation of Brier scores for BSLMM in the seven diseases are 0.232 (0.004) for HT, 
0.230 (0.004) for CAD, 0.230 (0.003) for T2D, 0.222 (0.004) for BD, 0.211 (0.004) for CD, 0.204 (0.004) 
for RA, 0.139 (0.006) for T1D. 
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Figure S5. Comparison of prediction performance of several models with BSLMM for three traits in 
the heterogenous stock mouse data set. Performance is measured by correlation difference with respect 
to BSLMM, where a positive value indicates better performance than BSLMM. The x-axis shows two 
different ways to split the data into a training set and a test set, each with 20 replicates. The mean 
correlation of BSLMM for the six cases are 0.72, 0.61, 0.61, 0.47, 0.21 and 0.14, respectively. 
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Figure S6. Comparison of prediction performance between BSLMM and probit BSLMM for three 
binary traits in the heterogenous stock mouse data set. Performance is measured by Brier score 
difference with respect to BSLMM, where a positive value indicates worse performance than BSLMM. 
The x-axis shows two different ways to split the data into a training set and a test set, each with 20 
replicates. The mean Brier scores of BSLMM for the six cases are 0.185, 0.205, 0.201, 0.236, 0.245 and 
0.249, respectively. 
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Tables 

Table 1. Summary of some effect size distributions that have been proposed for polygenic modeling. 



Eff 

Name 


ect Size Distribution 

Formula 


Keywords 


Selected References 


t 




BayesA 


27 32 ,33, 40 


point-t 


ft ~ 7Tt(0, V, O + (1 - 7t)5q 


BayesB, BayesD, BayesD-7r 


[ITllMElllQ] 


t mixture 


ft ~7Tt(0,I/,^) 
+ (l-7T)t(0,l/,0.01o2) 


BayesC 


[sai33] 


point-normal 


ft ~7rN(0,a2) + (l-7r)(5o 


BayesC, BayesCvr, BVSR 


ESES1I35I 


double exponential 


ft ~ DE(0, 0) 


Bayesian Lasso 


[28j[39ll68] 


point-normal mixture 


ft ~ 7TlN(0, CT^) + 7T 2 N(0, O.lo-S) 
+7r 3 N(0, O.Olcr^) + (1 - TTi - 7T2 - 7T 3 )5o 


BayesR 


m\ 


normal 


ft ~N(0,<^) 


LMM, BLUP, Ridge Regression 


I22II26II28II48I 


normal-exponential 
-gamma 


ft ~ NEG(O,k,0) 


NEG 


[16] 


normal mixture 


ft ~7rN(0,(7S+<7j) 

+ (I-^)N(0,a b 2 ) 


BSLMM 


Present Work 



The reference list contains only a selection of relevant publications. Abbreviations: DE denotes double 
exponential distribution, NEG denotes normal exponential gamma distribution, and other abbreviations can 
be found in the main text. In the scaled t-distribution, v and ^ are the degree of freedom parameter and 
scale parameter, respectively. In the DE distribution, 6 is the scale parameter. In the NEG distribution, n and 
9 are the shape and scale parameters, respectively. Notes: 1. Some applications of these methods combine 
a particular effect size distribution with a random effects term, with covariance matrix K, to capture sample 
structure or relatedness. If K oc XX T then this is equivalent to adding a normal distribution to the effect size 
distribution. The listed effect size distributions in this table do not include this additional normal component. 
2. BayesC has been used to refer to models with different effect size distributions in different papers. 3. In some 
papers, keywords listed here have been used to refer to fitting techniques rather than effect size distributions. 
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Table 2. PVE and PGE estimates for five human traits. 

Method Height HDL LDL TC TG 

LMM 0.42 (0.08) 0.38 (0.15) 0.22 (0.18) 0.22 (0.17) 0.34 (0.17) 

PVE BVSR 0.15 (0.03) 0.06 (0.01) 0.10 (0.08) 0.15 (0.07) 0.05 (0.06) 
BSLMM 0.41 (0.08) 0.34 (0.14) 0.22 (0.14) 0.26 (0.14) 0.29 (0.17) 

PGE BSLMM 0.12 (0.13) 0.21 (0.14) 0.27 (0.26) 0.46 (0.30) 0.18 (0.20) 

PVE estimates are obtained using LMM, BVSR and BSLMM, while PGE estimates are obtained 
using BSLMM. Values in parentheses are standard error (for LMM) or standard deviation of posterior 
samples (for BVSR and BSLMM). n = 3,925,p = 294,831 for height, and n = l,868,p = 555,601 for 
other four traits. 



4G 



Table 3. Mean computation time, in hours, of various methods for the mouse data set. 



Method 


Time in hrs (sd) 


LMM 


0.007 (0.002) 


LMM-Bayes 


0.14 (0.02) 


BSLMM-EB 


2.44 (3.52) 


BSLMM 


3.86 (4.13) 


BVSR 


9.20 (6.36) 


BaycsC7r 


39.4 (11.7) 


BaycsA-Flex 


68.6 (8.06) 


Bayesian Lasso 


78.6 (23.4) 



Values in parentheses arc standard deviations. Means and standard deviations are calculated 
based on 2.1 million MCMC iterations in 120 replicates: 20 intra-family and 20 inter-family splits for 
three phenotypes. Computations were performed on a single core of an Intel Xeon L5420 2.50 GHz 
CPU. Since computing times for many methods will vary with number of iterations used, and we did not 
undertake a comprehensive evaluation of how many iterations suffice for each algorithm, these results 
provide only a very rough guide to the relative computational burden of different methods. 
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Table 4. Mean computation time, in hours, of BVSR and BSLMM in all examples used in the present 
study. 



Simulations, PVE = 0.2 






Number of Causal SNPs 






10 


100 


1000 10000 


10/10000 


100/10000 


BVSR 
BSLMM 


8.25 (1.11) 
8.44 (2.04) 


50.0 (17.3) 
16.9 (4.06) 


46.3 (25.0) 32.8 (25.0) 
5.44 (0.23) 5.19 (0.20) 


50.6 (23.9) 
29.3 (17.2) 


35.2 (24.7) 
37.5 (19.1) 


Simulations, PVE = 0.6 






Number of Causal SNPs 






10 


100 


1000 10000 


10/10000 


100/10000 


BVSR 
BSLMM 


7.43 (0.72) 
7.34 (1.05) 


36.9 (5.71) 
41.2 (5.12) 


118 (8.94) 96.1 (12.0) 
39.8 (15.0) 5.13 (0.34) 


75.4 (28.9) 

13.5 (5.52) 


92.8 (15.0) 

45.9 (16.6) 



Real Data 


BD CAD 


CD 


Diseases 
HT 


RA 


T1D 


T2D 


BVSR 


131 (20.3) 110 (20.3) 


105 (13.9) 


114 (21.0) 


12.4 (3.52) 


14.8 (2.47) 


123 (22.0) 


BSLMM 


21.7 (13.6) 22.0 (16.2) 


57.6 (24.6) 


32.8 (23.8) 


9.58 (1.17) 


14.5 (2.52) 


38.7 (20.3) 


Real Data 




Quantitative Traits 








Height HDL LDL 


TC TG 


CD8 


MCH 


BMI 




BVSR 
BSLMM 


96.1 2.94 21.4 
77.4 3.35 7.24 


24.4 18.8 
24.0 6.53 


5.89 (1.87) 
1.14 (1.05) 


10.5(5.83) 
2.45(3.13) 


11.2(8.30) 
7.97(3.79) 





Computations were performed on a single core of an Intel Xeon L5420 2.50 GHz CPU, with 2.1 million 
MCMC iterations. Values in parentheses are standard deviations. Means and standard deviations are calculated 
based on 20 replicates in simulations, 20 replicates in the WTCCC data set, and 40 replicates - 20 intra-family 
and 20 inter- family splits - in the mouse data set. 



48 



Table SI. Estimates of PVE, PGE and log 10 (7r) for CD8, MCH and BMI in the mouse data set. PVE 
estimates are obtained using LMM, BVSR and BSLMM, log 10 (7r) estimates are obtained using BVSR 
and BSLMM, and PGE estimates are obtained using BSLMM. Values in parentheses are standard error 
(for LMM) or standard deviation of posterior samples (for BVSR and BSLMM). n = 1, 410, p = 10, 768 
for CD8, n = 1, 580, p = 10, 744 for MCH, and n = 1, 828, p = 10, 771 for BMI. 





Method 


CD8 


MCH 


BMI 


PVE 


LMM 
BVSR 
BSLMM 


0.61 (0.03) 
0.66 (0.02) 
0.64 (0.02) 


0.64 (0.03) 
0.57 (0.02) 
0.61 (0.03) 


0.14 (0.03) 
0.13 (0.02) 
0.13 (0.02) 


PGE 


BSLMM 


0.50 (0.09) 


0.43 (0.06) 


0.40 (0.28) 


iogioO) 


BVSR 


-1.69 (0.12) 


-1.62 (0.09) 


-1.52 (0.21) 


BSLMM 


-2.68 (0.24) 


-2.77 (0.19) 


-2.44 (0.70) 



